Accessing data:

Motivations

With the development of mobile devices and internet, the effects of social networks has grown increasingly in recent years. We would like to use real world data from social network applications to investigate interesting things in our daily life. We love to explore good food around us in our spare time and used to use Yelp as a searching engine for restaurants. Actually, more and more people use Yelp as a consumer reviews site for finding out top recommendations of restaurants and other services. Thus, we decide to use data from Yelp to predict the success of restaurants. In this project, we included only the restaurants in Pennsylvania (PA) and Quebec (QC) states because we want to compare two states in the North East of America and Canada. We want to study this data and explore the insights of restaurants’ success near us.

Investigating Questions:

Let’s start our project from data wrangling.

Data Wrangling

Load our RData files.

# set working directory
# Loading RData
library(dplyr)
library(tidyr)

y<-load("business.RData")
rm(y)

r <-load("review.RData")
rm(r)

We check the format of each column in the data frame, and we find out that there are lists and data frame inside the data frame of business and review. For example, business categories is the list type, such as c(“Chinese”,“Restaurants”), and attribute is a data frame. For the review data, the votes column is a data frame that consists of votes.funny, votes.useful, and votes.cool.

# Data wrangling
lapply(business, class)
## $business_id
## [1] "character"
## 
## $full_address
## [1] "character"
## 
## $hours
## [1] "data.frame"
## 
## $open
## [1] "logical"
## 
## $categories
## [1] "list"
## 
## $city
## [1] "character"
## 
## $review_count
## [1] "integer"
## 
## $name
## [1] "character"
## 
## $neighborhoods
## [1] "list"
## 
## $longitude
## [1] "numeric"
## 
## $state
## [1] "character"
## 
## $stars
## [1] "numeric"
## 
## $latitude
## [1] "numeric"
## 
## $attributes
## [1] "data.frame"
## 
## $type
## [1] "character"
lapply(review,class)
## $votes
## [1] "data.frame"
## 
## $user_id
## [1] "character"
## 
## $review_id
## [1] "character"
## 
## $stars
## [1] "integer"
## 
## $date
## [1] "character"
## 
## $text
## [1] "character"
## 
## $type
## [1] "character"
## 
## $business_id
## [1] "character"

Find all restaurants in QC and PA

Because the business categories is the list, we could not apply the function filter in dplyr, we decided to find all the indexes of restaurants in PA and QC states. Then, we match restaurant index ID with business ID. We subset the business data frame with matching business restaurant ID. For the review data, we find review restaurant ID indexes by the matching business ID. Then, we filtered the review data frame by matching the review ID with the review restaurant indexes.

#table(business$state)

#Filter Restaurants in categories and states PA,QC (states with more than 1000 restaurants)

#getting the index
business_restaurant_index <- which(grepl("Restaurants",business$categories) & business$state %in% c("QC","PA"))

#getting business id by index
business_restaurant_id <- business$business_id[business_restaurant_index]

review_restaurant_index <- which(review$business_id %in% business_restaurant_id) 

review_restaurant_id <- review[review_restaurant_index,"review_id"]

# Filter review by review_id
review <- review[review$review_id %in% review_restaurant_id,]

#Filter business by business_id
business <-business[business$business_id %in% business_restaurant_id,] 

Maps of two states

Before we manipulate more variables, we want to see what the location distributions of the restaurants in this two states. Now we show the restaurants distribution on the maps for PA and QC, and find out that almost all the restaurants in Yelp dataset are around Pittsburgh, the second largest city in Pennsylvania, and Montreal, the largest city in Quebec. We can conclude that there are more restaurants in the big cities, and people are in the large cities would more likely to give the ratings and share their comments on the Yelp.

# Map of all restaurants
library(ggmap)
mapsData <- business %>% select(state, longitude, latitude)

mapPA <- get_map("pennsylvania", zoom = 7)
PA <- ggmap(mapPA)+
  geom_point(data = mapsData %>% filter(state == "PA"), aes(x = longitude, y = latitude), color = "#CC6666") 
PA

#ggsave("PA.png", PA)

mapPitts <- get_map("pittsburgh", zoom = 12)
pitts <- ggmap(mapPitts)+
  geom_point(data = mapsData %>% filter(state == "PA"), aes(x = longitude, y = latitude, alpha = 1/20), color = "#CC6666") 
#ggsave("pitts.png", pitts) 
pitts

mapQC <- get_map("quebec", zoom = 5)
QC <- ggmap(mapQC)+
  geom_point(data = mapsData %>% filter(state == "QC"), aes(x = longitude, y = latitude), color = "#6666CC") 
#ggsave("QC.png", QC)
QC

mapMont <- get_map("montreal", zoom = 11)
montreal <- ggmap(mapMont)+
  geom_point(data = mapsData %>% filter(state == "QC"), aes(x = longitude, y = latitude, alpha = 1/20), color = "#6666CC")
#ggsave("montreal.png", montreal)
montreal

Select and manipulate variables

Now we specify some kind of foods and add them into dataset.

We want to create columns of binary variable for each food cuisine and fast food. The kind of food information is also contained in the business categories column of business data frame. Thus, we have to overcome with the trouble of filtering it. First, we add new columns of Mexican, Fastfood, American,Asian,and European into business data frame. Next, we creat temporary dataframes, and loop through all lists in the business categories column. Then, we use ifelse function to check each condition and assign 1 or 0 to each temporary dataframe. Finally, we replace the empty column with each column of the new dataframes.

# Mexican, fast food, American, Asian, Europeans, Latin Americans

business$Mexican=NA
business$Fastfood=NA
business$American=NA
business$Asian=NA
business$European=NA

ff <- data.frame()
mex <- data.frame()
asia <- data.frame()
america <- data.frame()
european <- data.frame()

for (i in 1:length(business$categories)){
  ff<-rbind(ff,ifelse("Fast Food" %in% business$categories[[i]],1,0))
  mex <- rbind(mex, ifelse("Mexican" %in% business$categories[[i]],1,0))
  asia <- rbind(asia,ifelse("Chinese" %in% business$categories[[i]]|"Japanese" %in% business$categories[[i]]|"Asian Fusion" %in% business$categories[[i]]|"Thai" %in% business$categories[[i]]|"Indian" %in% business$categories[[i]]|"Taiwanese" %in%  business$categories[[i]]|"Vietnamese" %in% business$categories[[i]]|"Korean" %in% business$categories[[i]],1,0))
  america <- rbind(america,ifelse("American (New)" %in% business$categories[[i]]|"American (Traditional)" %in% business$categories[[i]],1,0))
  european <- rbind(european,ifelse("Italian" %in% business$categories[[i]]|"French" %in% business$categories[[i]]|"Irish" %in% business$categories[[i]]|"Greek" %in% business$categories[[i]]|"Austrian" %in% business$categories[[i]]|"British" %in% business$categories[[i]]|"German" %in% business$categories[[i]],1,0))
}

colnames(ff)[1]<- "Fastfood"
colnames(asia)[1]<-"Asia"
colnames(mex)[1] <- "Mexico"
colnames(america)[1] <- "America"
colnames(european)[1] <- "European"

business$Fastfood <- ff$Fastfood
business$Asian <-asia$Asia
business$Mexican <-mex$Mexico
business$American <- america$America
business$European <-european$European

rm(ff)
rm(asia)
rm(mex)
rm(america)
rm(european)

Here we add columns of features into the business data frame.

All features of interest are in the attributes column inside the business data frame. We select reservation, delivery, card, takeout and parking. There are some other potential features and amenities that can affect the star rating. However, they are a lot of missing values, so we decide to filter them out.

The attribute column is itself a data frame, so we also use similar approach of creating a temporary dataframe and then later replace the column with the empty column of each feature in the business data frame. They are binary variable, and we replace some NA values with zero.

#Attributes: wifi, smoking, reservation, delivery, parking, credit card, happy hour, take out, drive through
business$reservation <- NA
business$delivery <- NA
business$card <- NA
business$takeout <- NA
business$parking <- NA

#save all feature as a seperate data frame
tempA <- data.frame()

#reservation- false 1836, true 2036
tempA <- ifelse(business$attributes$`Takes Reservations`==TRUE,1,0) %>% as.data.frame()
colnames(tempA)[1]<- "reservation"

#delivery- false 2863, true 966
tempA$delivery <- NA
tempA$delivery <- ifelse(business$attributes$Delivery==TRUE,1,0)

#credit card- false 345, true 3792
tempA$card <- NA
tempA$card <- ifelse(business$attributes$`Accepts Credit Cards`==TRUE,1,0)

#take out- false 627, true 3321
tempA$takeout <- NA
tempA$takeout <- ifelse(business$attributes$`Take-out`==TRUE,1,0)

# false 1499, true 1945
tempA$parking <- NA
tempA$parking <- ifelse(business$attributes$Parking$garage==TRUE|business$attributes$Parking$street==TRUE|business$attributes$Parking$validated==TRUE|business$attributes$Parking$lot==TRUE|business$attributes$Parking$valet==TRUE,1,0)

tempA <- tempA %>% replace_na(list(reservation=0,delivery=0,card=0,takeout=0,parking=0))

business$reservation <- tempA$reservation
business$delivery <- tempA$delivery
business$card <- tempA$card
business$takeout <- tempA$takeout
business$parking <- tempA$parking

rm(tempA)

Now we add some attributes from review dataset, including number of votes funny, useful, and cool

We use group by and sum up the number of votes funny, useful, and cool of each business ID in the review data frame. We select only columns that contain potential predictor in our model in the business data to delete and clean up other columns of list and data frame type in business. We call the new data frame is business_data. Then, we join the new data frame with columns of votes funny, useful, and cool from the review data.

# select some columns of review and make it not a data.frame
tempB <- review%>%select(business_id,stars)
tempB <- tempB%>%mutate(vote_funny=review$votes$funny,
                        vote_useful=review$votes$useful,
                        vote_cool=review$votes$cool)
#group_by business 
r<-tempB%>%
  group_by(business_id)%>%
  summarise(votes_funny=sum(vote_funny),
            votes_useful=sum(vote_useful), 
            votes_cool=sum(vote_cool))
rm(tempB)

# select columns in business
business_data <- business%>%
  select(business_id,review_count,
         state,stars,reservation,delivery,card,takeout,parking,Mexican,American,Fastfood,  European, Asian)

#left_join the review columns with the business file 
business_data <- left_join(business_data,r,by="business_id")

Now we add some text analysis from review data and finalize data wrangling part.

We use the library tidytext to tokenize words from the review text column for each business ID. Then, we also match these word with the sentiments to find positive and negative words. We calculat the positivity of review text based on the number of positive and negative words. Last, we join the column of positivity into the data frame business_data. The positivity is one of the potential predictors in our predictive model.

We chang the state into binary variable of whether each business ID belongs to the state PA or QC. We delet the business ID column in order to build predictive models with other features.

library(tidytext)

review_text<-review %>%
    select(business_id, text, stars)%>%
    unnest_tokens(word, text)%>%
    filter(!word %in% stop_words$word)

nrc <- sentiments %>%
    filter(lexicon == "nrc") %>%
    select(word, sentiment)

review_sentiment <- review_text %>%
    inner_join(nrc) %>%
     count(business_id,sentiment)%>%
    spread(sentiment, n, fill = 0) %>%
    mutate(positivity = (positive - negative) / (positive + negative + 1))
## Joining by: "word"
# join with business_data
# use positivity
business_data <- review_sentiment%>%
  select(c(business_id,positivity))%>%
  left_join(business_data,by="business_id")

# state
business_data<- business_data%>%
  mutate(PA=ifelse(state=="PA",1,0))

business_data <- subset(business_data, select = -c(business_id) )

Analysis

Exploratory Analysis (Plots)

We creat plots to explore and visualize the effect of some features on star rating. First, we would like to examine how difference in the text review of star rating of 1 versus star rating of 5. Second, we explore the review count distribution of each star categories for PA versus QC. We also create the density plot to visualize the distribution of positivity of each star category for two states.

Cross validation

We divide the data into training and testing sets randomly, with 80% of the data for training and 20% for testing. We use the training set to build the predictive models and evaluate the model with the testing set.

# cross validation
#library(caret)
#delete the orginal state column
business_data <- business_data %>% select(-c(state))

# create train set and test set with 80% in train and 20% in test
inTrain <- createDataPartition(y = business_data$stars, p=0.8)
training <- slice(business_data, inTrain$Resample1)
testing <- slice(business_data, -inTrain$Resample1)

#run 10 cross validations leaving out 20% of the data
control <- trainControl(method='cv', number=10, p=.8) 

Fitting different models

Linear model

We first build the linear regression with all potential predictors, and we use backward selection to choose significant features that related to star rating. These features will be used to build other prediction models to figure out the most effective model for the prediction of star rating of restaurants.

res_lm1 <- training%>%
  lm(stars ~ ., data = .)

summary(res_lm1)
## 
## Call:
## lm(formula = stars ~ ., data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.55741 -0.36431  0.02322  0.39576  3.03064 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.8954782  0.0365138  79.298  < 2e-16 ***
## positivity    1.7502085  0.0538268  32.516  < 2e-16 ***
## review_count -0.0020195  0.0005336  -3.784 0.000157 ***
## reservation  -0.0379047  0.0235112  -1.612 0.107004    
## delivery     -0.0701267  0.0265354  -2.643 0.008258 ** 
## card         -0.1446895  0.0292425  -4.948 7.84e-07 ***
## takeout       0.0997310  0.0252448   3.951 7.94e-05 ***
## parking       0.0955866  0.0239228   3.996 6.58e-05 ***
## Mexican      -0.0612688  0.0580181  -1.056 0.291024    
## American     -0.1876667  0.0396348  -4.735 2.27e-06 ***
## Fastfood     -0.3562367  0.0534471  -6.665 3.04e-11 ***
## European     -0.0044251  0.0288102  -0.154 0.877938    
## Asian        -0.1456820  0.0278989  -5.222 1.87e-07 ***
## votes_funny  -0.0164573  0.0020954  -7.854 5.25e-15 ***
## votes_useful -0.0014164  0.0010185  -1.391 0.164379    
## votes_cool    0.0210122  0.0018720  11.224  < 2e-16 ***
## PA           -0.0437980  0.0254378  -1.722 0.085196 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.609 on 3664 degrees of freedom
## Multiple R-squared:  0.3125, Adjusted R-squared:  0.3095 
## F-statistic: 104.1 on 16 and 3664 DF,  p-value: < 2.2e-16
#remove European
res_lm2 <- training%>%
  lm(stars ~ positivity + review_count + reservation + delivery + card +  takeout + parking+ Mexican + American + Fastfood + Asian + votes_funny + votes_useful + votes_cool + PA, data = .)

summary(res_lm2) 
## 
## Call:
## lm(formula = stars ~ positivity + review_count + reservation + 
##     delivery + card + takeout + parking + Mexican + American + 
##     Fastfood + Asian + votes_funny + votes_useful + votes_cool + 
##     PA, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.55719 -0.36450  0.02278  0.39499  3.03106 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.8945465  0.0360016  80.400  < 2e-16 ***
## positivity    1.7499240  0.0537877  32.534  < 2e-16 ***
## review_count -0.0020211  0.0005335  -3.789 0.000154 ***
## reservation  -0.0385480  0.0231321  -1.666 0.095713 .  
## delivery     -0.0699413  0.0265044  -2.639 0.008354 ** 
## card         -0.1447883  0.0292315  -4.953 7.63e-07 ***
## takeout       0.1002070  0.0250505   4.000 6.45e-05 ***
## parking       0.0954416  0.0239010   3.993 6.65e-05 ***
## Mexican      -0.0603400  0.0576944  -1.046 0.295697    
## American     -0.1867614  0.0391888  -4.766 1.96e-06 ***
## Fastfood     -0.3556403  0.0532987  -6.673 2.89e-11 ***
## Asian        -0.1446078  0.0270045  -5.355 9.08e-08 ***
## votes_funny  -0.0164585  0.0020951  -7.856 5.18e-15 ***
## votes_useful -0.0014147  0.0010183  -1.389 0.164816    
## votes_cool    0.0210147  0.0018717  11.228  < 2e-16 ***
## PA           -0.0438225  0.0254339  -1.723 0.084974 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6089 on 3665 degrees of freedom
## Multiple R-squared:  0.3125, Adjusted R-squared:  0.3096 
## F-statistic:   111 on 15 and 3665 DF,  p-value: < 2.2e-16
#remove PA
res_lm3 <- training%>%
  lm(stars ~ positivity + review_count + reservation + delivery + card +  takeout + parking+ Mexican + American + Fastfood + Asian + votes_funny + votes_useful + votes_cool, data = .)

summary(res_lm3) 
## 
## Call:
## lm(formula = stars ~ positivity + review_count + reservation + 
##     delivery + card + takeout + parking + Mexican + American + 
##     Fastfood + Asian + votes_funny + votes_useful + votes_cool, 
##     data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.53683 -0.36938  0.02371  0.39879  3.04413 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.8883638  0.0358320  80.609  < 2e-16 ***
## positivity    1.7498415  0.0538022  32.524  < 2e-16 ***
## review_count -0.0020539  0.0005333  -3.852 0.000119 ***
## reservation  -0.0289384  0.0224557  -1.289 0.197587    
## delivery     -0.0726084  0.0264662  -2.743 0.006110 ** 
## card         -0.1469778  0.0292117  -5.031 5.10e-07 ***
## takeout       0.0955685  0.0249121   3.836 0.000127 ***
## parking       0.0858846  0.0232547   3.693 0.000225 ***
## Mexican      -0.0638218  0.0576745  -1.107 0.268545    
## American     -0.2075151  0.0373018  -5.563 2.84e-08 ***
## Fastfood     -0.3622008  0.0531768  -6.811 1.13e-11 ***
## Asian        -0.1433648  0.0270021  -5.309 1.17e-07 ***
## votes_funny  -0.0167142  0.0020904  -7.996 1.71e-15 ***
## votes_useful -0.0017024  0.0010047  -1.694 0.090288 .  
## votes_cool    0.0217212  0.0018267  11.891  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6091 on 3666 degrees of freedom
## Multiple R-squared:  0.3119, Adjusted R-squared:  0.3093 
## F-statistic: 118.7 on 14 and 3666 DF,  p-value: < 2.2e-16
#remove Mexican
res_lm4 <- training%>%
  lm(stars ~ positivity + review_count + reservation + delivery + card +  takeout + parking+ American + Fastfood + Asian + votes_funny + votes_useful + votes_cool, data = .)

summary(res_lm4) 
## 
## Call:
## lm(formula = stars ~ positivity + review_count + reservation + 
##     delivery + card + takeout + parking + American + Fastfood + 
##     Asian + votes_funny + votes_useful + votes_cool, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.53393 -0.37212  0.02492  0.40010  3.04813 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.8866496  0.0357996  80.634  < 2e-16 ***
## positivity    1.7507210  0.0537979  32.543  < 2e-16 ***
## review_count -0.0020523  0.0005333  -3.848 0.000121 ***
## reservation  -0.0287027  0.0224553  -1.278 0.201256    
## delivery     -0.0708586  0.0264198  -2.682 0.007350 ** 
## card         -0.1468170  0.0292123  -5.026 5.25e-07 ***
## takeout       0.0935546  0.0248463   3.765 0.000169 ***
## parking       0.0845697  0.0232250   3.641 0.000275 ***
## American     -0.2046161  0.0372108  -5.499 4.08e-08 ***
## Fastfood     -0.3622956  0.0531784  -6.813 1.11e-11 ***
## Asian        -0.1407912  0.0269026  -5.233 1.76e-07 ***
## votes_funny  -0.0167349  0.0020903  -8.006 1.58e-15 ***
## votes_useful -0.0017306  0.0010045  -1.723 0.084989 .  
## votes_cool    0.0217832  0.0018259  11.930  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6091 on 3667 degrees of freedom
## Multiple R-squared:  0.3117, Adjusted R-squared:  0.3092 
## F-statistic: 127.7 on 13 and 3667 DF,  p-value: < 2.2e-16
#remove Reservation
res_lm5 <- training%>%
  lm(stars ~ positivity + review_count + delivery + card +  takeout + parking+ American + Fastfood + Asian + votes_funny + votes_useful + votes_cool, data = .)

summary(res_lm5) 
## 
## Call:
## lm(formula = stars ~ positivity + review_count + delivery + card + 
##     takeout + parking + American + Fastfood + Asian + votes_funny + 
##     votes_useful + votes_cool, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.52721 -0.36992  0.02344  0.39836  3.05218 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.8849845  0.0357790  80.634  < 2e-16 ***
## positivity    1.7430840  0.0534697  32.599  < 2e-16 ***
## review_count -0.0020543  0.0005333  -3.852 0.000119 ***
## delivery     -0.0697065  0.0264067  -2.640 0.008332 ** 
## card         -0.1583434  0.0277880  -5.698 1.31e-08 ***
## takeout       0.0988862  0.0244958   4.037 5.53e-05 ***
## parking       0.0820789  0.0231451   3.546 0.000396 ***
## American     -0.2030006  0.0371925  -5.458 5.13e-08 ***
## Fastfood     -0.3542592  0.0528100  -6.708 2.27e-11 ***
## Asian        -0.1458720  0.0266096  -5.482 4.49e-08 ***
## votes_funny  -0.0165763  0.0020868  -7.943 2.60e-15 ***
## votes_useful -0.0018276  0.0010017  -1.825 0.068146 .  
## votes_cool    0.0218462  0.0018254  11.968  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6091 on 3668 degrees of freedom
## Multiple R-squared:  0.3114, Adjusted R-squared:  0.3091 
## F-statistic: 138.2 on 12 and 3668 DF,  p-value: < 2.2e-16
#RMSE = SD/(1-R^2)
rmse_lm <- 0.6114*sqrt(1-0.3027) #0.51

Based on the result of the backward selection in linear regression model, we leave out four variables. They are European, PA, Mexican, and reservation.

GLM (Logistic Regression)

We fit other models using the selected variables from the result of backward selection. The first model is logistic regression.

# GLM (Logistic Regression)

fit1 <- glm(y~., data=mutate(training %>% select(-c(European, PA, Mexican, reservation)),
                                   y=stars==1),family="binomial")
fit2 <- glm(y~., data=mutate(training %>% select(-c(European, PA, Mexican, reservation)),
                                   y=stars==1.5),family="binomial")
fit3 <- glm(y~., data=mutate(training %>% select(-c(European, PA, Mexican, reservation)),
                                   y=stars==2),family="binomial")
fit4 <- glm(y~., data=mutate(training %>% select(-c(European, PA, Mexican, reservation)),
                                   y=stars==2.5),family="binomial")
fit5<- glm(y~., data=mutate(training %>% select(-c(European, PA, Mexican, reservation)),
                                   y=stars==3),family="binomial")
fit6 <- glm(y~., data=mutate(training %>% select(-c(European, PA, Mexican, reservation)),
                                   y=stars==3.5),family="binomial")
fit7 <- glm(y~., data=mutate(training %>% select(-c(European, PA, Mexican, reservation)),
                                   y=stars==4),family="binomial")
fit8 <- glm(y~., data=mutate(training %>% select(-c(European, PA, Mexican, reservation)),
                                   y=stars==4.5),family="binomial")
fit9 <- glm(y~., data=mutate(training %>% select(-c(European, PA, Mexican, reservation)),
                                   y=stars==5),family="binomial")


f_hat1 <- predict(fit1, newdata = testing %>% select(-c(European, PA, Mexican, reservation)), type = "response")
f_hat2 <- predict(fit2, newdata = testing %>% select(-c(European, PA, Mexican, reservation)), type ="response")
f_hat3 <- predict(fit3, newdata = testing %>% select(-c(European, PA, Mexican, reservation)), type = "response")
f_hat4 <- predict(fit4, newdata = testing %>% select(-c(European, PA, Mexican, reservation)), type ="response")
f_hat5 <- predict(fit5, newdata = testing %>% select(-c(European, PA, Mexican, reservation)), type = "response")
f_hat6 <- predict(fit6, newdata = testing %>% select(-c(European, PA, Mexican, reservation)), type ="response")
f_hat7 <- predict(fit7, newdata = testing %>% select(-c(European, PA, Mexican, reservation)), type = "response")
f_hat8 <- predict(fit8, newdata = testing %>% select(-c(European, PA, Mexican, reservation)), type ="response")
f_hat9 <- predict(fit9, newdata = testing %>% select(-c(European, PA, Mexican, reservation)), type = "response")
 

pred_glm <- 
  apply(cbind(f_hat1, f_hat2, f_hat3,f_hat4,f_hat5,f_hat6,f_hat7,f_hat8,f_hat9),1,which.max)

KNN

Another model that we try is the K Nearest Neighbors model. We use cross validation to select the best k first, and we find out that k=19 is the best value. We fit the model using k=19.

Due to the random spliting training dataset and testing dataset, the best K value would be changed every time.Here we just use the K=19.

#KNN
res <- train(as.factor(stars) ~ .,
             data = training %>% select(-c(European, PA, Mexican, reservation)),
             method = "knn",
             trControl = control,
             tuneLength = 1, 
             tuneGrid=data.frame(k=seq(1,20,2)),
             metric="Accuracy")

plot(res)

res$results %>% 
  ggplot(aes(k, Accuracy, ymin= Accuracy - AccuracySD, 
             ymax = Accuracy + AccuracySD)) +
  geom_point() + geom_errorbar()

res
## k-Nearest Neighbors 
## 
## 3681 samples
##   12 predictor
##    9 classes: '1', '1.5', '2', '2.5', '3', '3.5', '4', '4.5', '5' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3314, 3314, 3313, 3312, 3313, 3312, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa       Accuracy SD  Kappa SD  
##    1  0.2406964  0.04753534  0.01442082   0.01981020
##    3  0.2529188  0.05708952  0.01890461   0.02669665
##    5  0.2599760  0.05447496  0.01587096   0.02100396
##    7  0.2730180  0.06522813  0.01758798   0.02076092
##    9  0.2719266  0.05794568  0.01631473   0.01969678
##   11  0.2855336  0.07005398  0.02082657   0.02916119
##   13  0.2849842  0.06584916  0.01811627   0.02396622
##   15  0.2800810  0.05739721  0.02058369   0.02827827
##   17  0.2819817  0.05700791  0.01605344   0.02346938
##   19  0.2909588  0.06668215  0.01581188   0.02252475
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 19.
# fit with k=19
fit <- knn3(stars~., data=training %>% select(-c(European, PA, Mexican, reservation)), k=19)
f_hat <- predict(fit, newdata = testing %>% select(-c(European, PA, Mexican, reservation)))
pred_knn_19 <- apply(f_hat,1,which.max)

SVM and Random Forest

We would also like to try to build predictive model with support vector machine and random forest.

#SVM
#install.packages("e1071")

library(e1071)
fit_svm <- svm(stars~.,data=training %>% select(-c(European, PA, Mexican, reservation))) 
pred_svm <-predict(fit_svm,newdata=testing %>% select(-c(European, PA, Mexican, reservation)))

# Random Forest
library(randomForest)
fit_rr<-randomForest(stars~.,data=training %>% select(-c(European, PA, Mexican, reservation)))
pred_rr<-predict(fit_rr,newdata=testing %>% select(-c(European, PA, Mexican, reservation)))

RMSE

For comparing these models we fitted above and finding out the best one, we calculate the root mean squared error of the predicted star rating and the true rating for each model.

RMSE <- function(true_ratings, predicted_ratings){
    sqrt(mean((true_ratings - predicted_ratings)^2))
}

rmse_glm <- RMSE(testing$stars,pred_glm)
rmse_glm 
## [1] 2.963467
rmse_knn_19<- RMSE(testing$stars,pred_knn_19)
rmse_knn_19 
## [1] 2.944845
rmse_svm <- RMSE(testing$stars,pred_svm)
rmse_svm 
## [1] 0.6040334
rmse_rr <- RMSE(testing$stars,pred_rr)
rmse_rr 
## [1] 0.6098809

Table of RMSE Comparison

We create the table of RMSE for each model to compare the result. The logistic regression and k-Nearest neighbors do not perform very well. The RMSE of random forest and support vector machine values are smaller. Apparently the linear model has the smallest RMSE and would be our best model.

library(knitr)
rmse_results <- data_frame(method = "Logistic Regression", RMSE = rmse_glm)

rmse_results <- bind_rows(rmse_results, data_frame(method = "Linear Regression", RMSE = rmse_lm))

rmse_results <- bind_rows(rmse_results,
                          data_frame(method="K Nearest Neighbors",  
                                     RMSE = rmse_knn_19 ))

rmse_results <- bind_rows(rmse_results,
                          data_frame(method="Support Vector Machine",  
                                     RMSE = rmse_svm ))

rmse_results <- bind_rows(rmse_results,
                          data_frame(method="Random Forest",  
                                     RMSE = rmse_rr ))


rmse_results%>%kable
method RMSE
Logistic Regression 2.9634667
Linear Regression 0.5105465
K Nearest Neighbors 2.9448452
Support Vector Machine 0.6040334
Random Forest 0.6098809

Summary and Conclusions

From the project we could answer the questions we wanted to investigate.

We found out that delivery service, take-out, parking, accepting credit cards or not, American food, fast food, Asian food, the positivity of the review text, the counts of the reviews, and the number of the received votes of “useful”, “funny” and “cool” from other users have more effect on the stars of the restaurants.

American food, fast food, and Asian food are more related to the success of the restaurants than other kinds of food.

Not really.

Guests would more like to rate them if they have delivery, take-out, parking, accepting credit cards services.

The more positivity of the review text, the higher ratings of the restaurants.

By comparing the RMSE, we found out that the linear model outperforms other models.